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Abstract 

We describe our experience with automatic align- 
ment of sentences in parallel English-Chinese 
texts. Our report concerns three related topics: 
(1) progress on the HKUST English-Chinese Par- 
allel Bilingual Corpus; (2) experiments addressing 
the applicability of Gale & Church's (1991) length- 
based statistical method to the task of align- 
ment involving a non-Indo-European language; 
and (3) an improved statistical method that also 
incorporates domain-specific lexical cues. 

INTRODUCTION 

Recently, a number of automatic techniques for 
aligning sentences in parallel bilingual corpora 
have been proposed (Kay & Roscheisen 1988; 
Catizone et al. 1989; Gale k Church 1991; Brown 
et al. 1991; Chen 1993), and coarser approaches 
when sentences are difficult to identify have also 
been advanced (Church 1993; Dagan et al. 1993). 
Such corpora contain the same material that has 
been translated by human experts into two lan- 
guages. The goal of alignment is to identify match- 
ing sentences between the languages. Alignment is 
the first stage in extracting structural information 
and statistical parameters from bilingual corpora. 
The problem is made more difficult because a sen- 
tence in one language may correspond to multiple 
sentences in the other; worse yet, sometimes sev- 
eral sentences' content is distributed across multi- 
ple translated sentences. 

Approaches to alignment fall into two main 
classes: lexical and statistical. Lexically-based 
techniques use extensive online bilingual lexicons 
to match sentences. In contrast, statistical tech- 
niques require almost no prior knowledge and are 
based solely on the lengths of sentences. The 
empirical results to date suggest that statistical 
methods yield performance superior to that of cur- 
rently available lexical techniques. 

However, as far as we know, the literature 
on automatic alignment has been restricted to al- 



phabetic Indo-European languages. This method- 
ological fiaw weakens the arguments in favor of 
either approach, since it is unclear to what extent 
a technique's superiority depends on the similar- 
ity between related languages. The work reported 
herein moves towards addressing this problem.^ 

In this paper, we describe our experience 
with automatic alignment of sentences in paral- 
lel English-Chinese texts, which was performed as 
part of the SILC machine translation project. Our 
report concerns three related topics. In the first of 
the following sections, we describe the objectives 
of the HKUST English-Chinese Parallel Bilingual 
Corpus, and our progress. The subsequent sec- 
tions report experiments addressing the applica- 
bility of a suitably modified version of Gale & 
Church's (1991) length-based statistical method to 
the task of aligning English with Chinese. In the 
final section, we describe an improved statistical 
method that also permits domain-specific lexical 
cues to be incorporated probabilistically. 

THE ENGLISH-CHINESE 
CORPUS 

The dearth of work on non-Indo-European lan- 
guages can partly be attributed to a lack of the 
prequisite bilingual corpora. As a step toward 
remedying this, we are in the process of construct- 
ing a suitable English-Chinese corpus. To be in- 
cluded, materials must contain primarily tight, lit- 
eral sentence translations. This rules out most fic- 
tion and literary material. 

We have been concentrating on the Hong 
Kong Hansard, which are the parliamentary pro- 
ceedings of the Legislative Council (LegCo). Anal- 
ogously to the bilingual texts of the Canadian 
Hansard (Gale & Church 1991), LegCo tran- 
scripts are kept in full translation in both English 



^Some newer methods are also intended to be ap- 
plied to non-Indo-European languages in the future 
(Fung & Church 1994). 



and Cantonese.^ However, unlike the Canadian 
Hansard, the Hong Kong Hansard has not pre- 
viously been available in machine-readable form. 
We have obtained and converted these materials 
by special arrangement. 

The materials contain high-quality literal 
translation. Statements in LegCo may be made 
using either English or Cantonese, and are tran- 
scribed in the original language. A translation to 
the other language is made later to yield com- 
plete parallel texts, with annotations specifying 
the source language used by each speaker. Most 
sentences are translated 1-for-l. A small propor- 
tion are l-for-2 or 2-for-2, and on rare occasion 
l-for-3, 3-for-3, or other configurations. Samples 
of the English and Chinese texts can be seen in 
figures 3 and 4.^ 

Because of the obscure format of the origi- 
nal data, it has been necessary to employ a sub- 
stantial amount of automatic conversion and ref- 
ormatting. Sentences are identified automatically 
using heuristics that depend on punctuation and 
spacing. Segmentation errors occur occasionally, 
due either to typographical errors in the original 
data, or to inadequacies of our automatic conver- 
sion heuristics. This simply results in incorrectly 
placed delimiters; it does not remove any text from 
the corpus. 

Although the emphasis is on clean text so 
that markup is minimal, paragraphs and sentences 
are marked following TEI-conformant SGML 
(Sperberg-McQueen & Burnard 1992). We use the 
term "sentence" in a generalized sense including 
lines in itemized lists, headings, and other non- 
sentential segments smaller than a paragraph. 

The corpus currently contains about 60Mb of 
raw data, of which we have been concentrating 
on approximately 3.2Mb. Of this, 2.1Mb is text 
comprised of approximately 0.35 million English 
words, with the corresponding Chinese translation 
occupying the remaining 1.1Mb. 

STATISTICALLY-BASED 
ALIGNMENT 

The statistical approach to alignment can be sum- 
marized as follows: choose the alignment that 
maximizes the probability over all possible align- 
ments, given a pair of parallel texts. Formally, 

^Cantonese is one of the four major Han Chinese 
languages. Formal written Cantonese employs the 
same characters as Mandarin, with some additions. 
Though there are grammatical and usage differences 
between the Chinese languages, as between German 
and Swiss German, the written forms can be read by 
all. 

■^For further description see also Fung & Wu (1994). 



choose 

(1) argmaxPrM|Ti,T2) 
A 

where A is an alignment, and Ti and T2 are the 
English and Chinese texts, respectively. An align- 
ment ^ is a set consisting of Li ^ L2 pairs where 
each Li or L2 is an English or Chinese passage. 

This formulation is so extremely general that 
it is difficult to argue against its pure form. More 
controversial are the approximations that must be 
made to obtain a tractable version. 

The first commonly made approximation is 
that the probabilities of the individual aligned 
pairs within an alignment are independent, i.e., 

Pr(^|Ti,T2)« n Pr(ii ^i^2|Ti,T2) 

(Li^L2)eA 

The other common approximation is that each 
Pr(_Li ^ L2\Ti,T2) depends not on the entire 
texts, but only on the contents of the specific pas- 
sages within the alignment: 

Pr(^|Ti,T2)« n Pr(Li^L2\Li,L2) 

(Li^L2)eA 

Maximization of this approximation to the 
alignment probabilities is easily converted into a 
minimum-sum problem: 

(2) 

argmaxPr(^|Ti, T2) 
A 

argmax ]^ Fi(Li ^ L2\Li, L2) 

(Li^L2)eA 

= argrmn ^ - log Pr(_Li ^ 1.2 |ii , ^2) 

(Li^L2)eA 

The minimization can be implemented using a dy- 
namic programming strategy. 

Further approximations vary according to the 
specific method being used. Below, we first discuss 
a pure length-based approximation, then a method 
with lexical extensions. 

APPLICABILITY OF LENGTH- 
BASED METHODS TO CHINESE 

Length-based alignment methods are based on the 
following approximation to equation (2): 

(3) Pr(i.i ^ 1.21^1,^2) ~ Pr(i.i ^ i.2|/i, /2) 

where li = length(_Li) and I2 = length(_L2), mea- 
sured in number of characters. In other words, 
the only feature of Li and L2 that affects their 
alignment probability is their length. Note that 
there are other length-based alignment methods 



that measure length in number of words instead 
of characters (Brown et al. 1991). However, since 
Chinese text consists of an unsegmented character 
stream without marked word boundaries, it would 
not be possible to count the number of words in a 
sentence without first parsing it. 

Although it has been suggested that length- 
based methods are language-independent (Gale & 
Church 1991; Brown et al. 1991), they may in fact 
rely to some extent on length correlations arising 
from the historical relationships of the languages 
being aligned. If translated sentences share cog- 
nates, then the character lengths of those cognates 
are of course correlated. Grammatical similarities 
between related languages may also produce cor- 
relations in sentence lengths. 

Moreover, the combinatorics of non-Indo- 
European languages can depart greatly from Indo- 
European languages. In Chinese, the majority of 
words are just one or two characters long (though 
collocations up to four characters are also com- 
mon). At the same time, there are several thou- 
sand characters in daily use, as in conversation or 
newspaper text. Such lexical differences make it 
even less obvious whether pure sentence-length cri- 
teria are adequately discriminating for statistical 
alignment. 

Our first goal, therefore, is to test whether 
purely length-based alignment results can be repli- 
cated for English and Chinese, languages from 
unrelated families. However, before length-based 
methods can be applied to Chinese, it is first nec- 
essary to generalize the notion of "number of char- 
acters" to Chinese strings, because most Chinese 
text (including our corpus) includes occasional 
English proper names and abbreviations, as well 
as punctuation marks. Our approach is to count 
each Chinese character as having length 2, and 
each English or punctuation character as having 
length 1. This corresponds to the byte count for 
text stored in the hybrid English-Chinese encod- 
ing system known as Big 5 . 

Gale & Church's (1991) length-based align- 
ment method is based on the model that each 
English character in Li is responsible for generat- 
ing some number of characters in L2. This model 
leads to a further approximation which encapsu- 
lates the dependence to a single parameter 6 that 
is a function of li and I2 '■ 

Fi(Li ^ 1.21^1,^2) ~ Pr(i.i ^ L2\6(h,h)) 

However, it is much easier to estimate the distrib- 
utions for the inverted form obtained by applying 
B ayes' Rule: 

Pr(,5|i.i ^i.2)Pr(i.i ^i.2) 



be ignored during minimization. The other two 
distributions are estimated as follows. 

First we choose a function for 6{li,l2)- To 
do this we look at the relation between li and 
I2 under the generative model. Figure 1 shows 
a plot of English versus Chinese sentence lengths 
for a hand-aligned sample of 142 sentences. If 
the sentence lengths were perfectly correlated, the 
points would lie on a diagonal through the origin. 
We estimate the slope of this idealized diagonal 
c = E{r) = E(l2/li) by averaging over the training 
corpus of hand-aligned Li ^ L2 pairs, weighting 
by the length of In fact this plot displays sub- 
stantially greater scatter than the English-French 
data of Gale & Church (1991).'* The mean number 
of Chinese characters generated by each English 
character is c = 0.506, with a standard deviation 
a = 0.166. 

We now assume that I2 — lic is normally dis- 
tributed, following Gale & Church (1991), and 
transform it into a new gaussian variable of stan- 
dard form (i.e., with mean and variance 1) by 
appropriate normalization: 



(4) 



h - he 



This is the quantity that we choose to define as 
Hh,h)- Consequently, for any two pairs in a pro- 
posed alignment, Pr((5|_Li ^ L2) can be estimated 
according to the gaussian assumption. 

To check how accurate the gaussian assump- 
tion is, we can use equation (4) to transform the 
same training points from figure 1 and produce a 
histogram. The result is shown in figure 2. Again, 
the distribution deviates from a gaussian distri- 
bution substantially more than Gale & Church 
(1991) report for French/German/English. More- 
over, the distribution does not resemble any 
smooth distribution at all, including the logarith- 
mic normal used by Brown ei al. (1991), raising 
doubts about the potential performance of pure 
length-based alignment. 

Continuing nevertheless, to estimate the other 
term Pr(_Li ^ L2), a prior over six classes is con- 
structed, where the classes are defined by the num- 
ber of passages included within Li and L2. Table 1 
shows the probabilities used. These probabilities 
are taken directly from Gale & Church (1991); 
slightly improved performance might be obtained 
by estimating these probabilities from our corpus. 

The aligned results using this model were eval- 
uated by hand for the entire contents of a ran- 



Fr(Li^L2\6) 



Pr((5) 



where Pr(6) is a normalizing constant that can 



*The difference is also partly due to the fact that 
Gale & Church (1991) plot paragraph lengths instead 
of sentence lengths. We have chosen to plot sentence 
lengths because that is what the algorithm is based 



1. fMR FRED LI ( in Cantonese ) : J H^UBJU^Ptl: J 

2. I would like to talk about public assistance. J Si[iE^S.^!ftSI&iBlilo J 

3. I notice from your address that under the Public SfeBCll? oMilIMiffllJtA±S5:ii^^SflAS*^S, itjff 
Assistance Scheme, the basic rate of $825 a month for a ^8257dSiffM9507d, &!IJ[lte^l5%o J 

single adult will be increased by 15% to $950 a month. 

J 

4. However, do you know that the revised rate plus all U^xMWIMMWii:, gPlf J[I±P}TWJCIf;?iiA, S^SS 
other grants will give each recipient no more than fljj#©^P}Tlf ilIS5:ii^^Slijlf|5^#Sai20007d, ^~±=I3R 
$2000 a month? On average, each recipient will receive |f;ff5©i!p}Tlf S5^16007dMl7007d&fio I 
$1600 to $1700 a month. J 

5. In view of Hong Kong's prosperity and high living cost, MStSS'^^Sfn^feS^R^-^lwi, J3^^p^;fg^^ — "{Hfll^ 
this figure is very ironical. J S'^sfflfllo J 

6. May 1 have your views and that of the Government? J If PtiiCJtg5dli"M5t^, i^^'UmWi., M'^WMM'^MB. 

7. Do you think that a comprehensive review should be HUSgffiif fi, $iMM^20%~M30%, iJ^MMMWiM^ 
conducted on the method of calculating public 3il!B$S'5^feS?R^-o J 

assistance? J 

8. Since the basic rate is so low, it will still be far below ^^-^MBM^Ik^MMWMyjIi., fiS^Siffi^^lb 
the current level of living even if it is further increased l%i|5tf|?^IEWSSS'5AiSffit^l&o J 

by 20% to 30%. If no comprehensive review is carried 
out in this aspect, this " safety net " cannot provide 
any assistance at all for those who are really in need. J 

9. 1 hope Mr Governor will give this question a serious ftMISSTt^felSJ^H^fifoIilo J 
response. J 

10. fTHE GOVERNOR: J J 

11. It is not in any way to belittle the importance of the SffiBl^AS^ItgillltpBf IS, WMb&B^^ ^MWMM 
point that the Honourable Member has made to say fpgfcj*^ raBf IS?SWf4ilIfttif #3RI#fiilitto J 
that, when at the outset of our discussions 1 said that 1 

did not think that the Government would be regarded 
for long as having been extravagant yesterday, 1 did not 
realize that the criticisms would begin quite as rapidly 
as they have. J 

12. The proposals that we make on public assistance, both SlSfi^lS, ls6WSSSi5fijH5llM W^PfMSMS^fi 
the increase in scale rates, and the relaxation of the Wlfi^Ik^M&jjMBiti^MM, ^mMi^MM&jj^ 
absence rule, are substantial steps forward in Hong WMMtkM^\SMM6^M'^, if^?S3Rij, nJIS^ftijHy 
Kong which will, 1 think, be very widely welcomed. J ^^'^—±0^ S,WbWMwS.WMMo J 

13. But I know that there will always be those who, I am ^jg, K Xil Ifl f / * k ' - aIi h nil , f^: Si 'E I "J tjiJ iS jfi ' - 'l^' , 1^ 
sure for very good reason, will say you should have ^|gfgr^— «^ S.MJ^itfWBWi^WP'BM^&l^ 
gone further, you should have done more. J ^ J 

14. Societies customarily make advances in social welfare |g^|±#tiKB$B-S|±#?if Ij, UHHSfj* ASifi^Wffl 
because there are members of the community who J)j K fi'j nil sfi (H M fi'j , ISiiili /Vlflifl'j.'gM :: J 
develop that sort of case very often with eloquence and 

verve. J 

Figure 3: A sample of length-based alignment output. 



domly selected pair of English and Chinese files 
corresponding to a complete session, comprising 
506 English sentences and 505 Chinese sentences. 
Figure 3 shows an excerpt from this output. Most 
of the true 1-for-l pairs are aligned correctly. In 
(4), two English sentences are correctly aligned 
with a single Chinese sentence. However, the Eng- 
lish sentences in (6, 7) are incorrectly aligned 1- 
for-1 instead of 2-for-l. Also, (11, 12) shows an ex- 



ample of a 3-for-l, 1-for-l sequence that the model 
has no choice but to align as 2-for-2, 2-for-2. 

Judging relative to a manual alignment of the 
English and Chinese files, a total of 86.4% of 
the true Li ^ L2 pairs were correctly identified 
by the length-based method. However, many of 
the errors occurred within the introductory ses- 
sion header, whose format is domain-specific (dis- 



Figure 1: English versus Chinese sentence lengths. 




Figure 2: English versus Chinese sentence lengths. 



cussed below). If the introduction is discarded, 
then the proportion of correctly aligned pairs rises 
to 95.2%, a respectable rate especially in view of 
the drastic inaccuracies in the distributions as- 
sumed. A detailed breakdown of the results is 
shown in Table 2. For reference, results reported 
for English/French generally fall between 96% and 
98%. However, all of these numbers should be in- 
terpreted as highly domain dependent, with very 
small sample size. 

The above rates are for Type I errors. The 
alternative measure of accuracy on Type II er- 
rors is useful for machine translation applications, 
where the objective is to extract only 1-for-l sen- 
tence pairs, and to discard all others. In this case, 
we are interested in the proportion of 1-for-l out- 
put pairs that are true 1-for-l pairs. (In informa- 
tion retrieval terminology, this measures precision 
whereas the above measures recall.) In the test 
session, 438 1-for-l pairs were output, of which 
377, or 86.1%, were true matches. Again, how- 
ever, by discarding the introduction, the accuracy 
rises to a surprising 96.3%. 



-f-f- CP o'ln pn 1" t; 
-H- oCliiiiCiiljo 


r i [ IJI ■: Ij'J ) 


Li L2 


n 1 

U 1 


n nriQQ 


1 n 

1 u 


n nriQQ 


1 1 


n RQ 


1 2 


0.089 


2 1 


0.089 


2 2 


0.011 



Table 1: Priors for Pr(_Li ^ L'^). 



The introductory session header exemplifies 
a weakness of the pure length-based strategy, 
namely, its susceptibility to long stretches of pas- 
sages with roughly similar lengths. In our data 
this arises from the list of council members present 
and absent at each session (figure 4), but similar 
stretches can arise in many other domains. In such 
a situation, two slight perturbations may cause the 
entire stretch of passages between the perturba- 
tions to be misaligned. These perturbations can 
easily arise from a number of causes, including 
slight omissions or mismatches in the original par- 
allel texts, a l-for-2 translation pair preceding or 
following the stretch of passages, or errors in the 
heuristic segmentation preprocessing. Substantial 
penalties may occur at the beginning and ending 
boundaries of the misaligned region, where the 
perturbations lie, but the misalignment between 
those boundaries incurs little penalty, because the 
mismatched passages have apparently matching 
lengths. This problem is apparently exacerbated 
by the non-alphabetic nature of Chinese. Because 
Chinese text contains fewer characters, character 
length is a less discriminating feature, varying over 
a range of fewer possible discrete values than the 
corresponding English. The next section discusses 
a solution to this problem. 

In summary, we have found that the statisti- 
cal correlation of sentence lengths has a far greater 
variance for our English-Chinese materials than 
with the Indo-European materials used by Gale 
& Church (1991). Despite this, the pure length- 
based method performs surprisingly well, except 
for its weakness in handling long stretches of sen- 
tences with close lengths. 

STATISTICAL INCORPORATION 
OF LEXICAL CUES 

To obtain further improvement in alignment accu- 
racy requires matching the passages' lexical con- 
tent, rather than using pure length criteria. This 
is particularly relevant for the type of long mis- 
matched stretches described above. 

Previous work on alignment has employed ei- 





1-1 1-2 2-1 2-2 1-3 3-1 3-3 


Total 
Correct 
Incorrect 
% Correct 


433 20 21 2 1 1 1 
361 17 20 
11 3 12 111 
87.1 85.0 95.2 0.0 0.0 0.0 0.0 



Table 2: Detailed breakdown of length-based alignment results. 



1. fTHE DEPUTY PRESIDENT THE HONOURABLE *\^WC^WiMm±Mm., K.B.E., L.V.O., J. P. J 
JOHN JOSEPH SWAINE, C.B.E., Q.C., J. P. J 

2. fTHE CHIEF SECRETARY THE HONOURABLE UMBC^I^iffilliM, C.B.E., J. P. J 
SIR DAVID ROBERT FORD, K.B.E., L.V.O., J. P. I 



3. fTHE FINANCIAL SECRETARY THE 

HONOURABLE NATHANIEL WILLIAM HAMISH 
MACLEOD, C.B.E., J. P. I 



f#icig,ims~iiM, C.M.G., j.p. j 



: 37 misaligned maichmgs omitted 

41. fTHE HONOURALBE MAN SAl - CHEONG J 

42. fTHE HONOURABLE STEVEN POON KWOK - 
LIM THE HONOURABLE HENRY TANG YING 
YEN, J.P. J 

43. fTHE HONOURABLE TIK CHI - YUEN J 



fffiPiif IIM J 
1[lgSI¥llM, J.P. J 



Figure 4: A sample of misalignment using pure length criteria. 



ther solely lexical or solely statistical length cri- 
teria. In contrast, we wish to incorporate lexical 
criteria without giving up the statistical approach, 
which provides a high baseline performance. 

Our method replaces equation (3) with the fol- 
lowing approximation: 

Ri Pr(_Li ^ L2\h,l2, vi,wi, ... , v„,w„) 

where Vi = #occurrences(English cue,, and 
Wi = ^occurrences (Chinese cuej',_L2)- Again, the 
dependence is encapsulated within difference pa- 
rameters 6i as follows: 

Fi(Li^L2\Li,L2) 
«Pr( Li^L2\ 

6o{h,h), 6i(vi,wi), ... , 6„(v„,w„)) 

Bayes' Rule now yields 

Pr(_Li — L2\6o, 61,62,... , 6„) 

(X Pr(,5o,(5i,... ^i.2)Pr(i.i 

The prior Pr(_Li ^ L2) is evaluated as before. We 
assume all 6i values are approximately indepen- 
dent, giving 

(5) 

n 

Pr{6o,6i, . . . ,6„\Li ^ L2) ^l[Pr{6i\Li ^ L2) 

8 = 



The same dynamic programming optimization 
can then be used. However, the computation and 
memory costs grow linearly with the number of 
lexical cues. This may not seem expensive until 
one considers that the pure length-based method 
only uses resources equivalent to that of a single 
lexical cue. It is in fact important to choose as 
few lexical cues as possible to achieve the desired 
accuracy. 

Given the need to minimize the number of lex- 
ical cues chosen, two factors become important. 
First, a lexical cue should be highly reliable, so 
that violations, which waste the additional com- 
putation, happen only rarely. Second, the chosen 
lexical cues should occur frequently, since comput- 
ing the optimization over many zero counts is not 
useful. In general, these factors are quite domain- 
specific, so lexical cues must be chosen for the par- 
ticular corpus at hand. Note further that when 
these conditions are met, the exact probability dis- 
tribution for the lexical 6i parameters does not 
have much infiuence on the preferred alignment. 

The bilingual correspondence lexicons we have 
employed are shown in figure 5. These lexical 
items are quite common in the LegCo domain. 
Items like "C.B.E." stand for honorific titles such 
as "Commander of the British Empire" ; the other 
cues are self-explanatory. The cues nearly always 
appear 1-to-l and the differences 6i therefore have 







governor 





C.B.E. 


C.B.E. 


C.M.G. 


C.M.G. 


LS.O. 


LS.O. 


J.B.E. 


J.B.E. 


J. P. 


J. P. 


K.B.E. 


K.B.E. 


L.V.O. 


L.V.O. 


O.B.E. 


O.B.E. 


M.B.E. 


M.B.E. 


Q.C. 


Q.C. 


January 


-Fl 


February 


=iJ! 


March 




April 


Hi! 


May 


liFl 


June 




July 


±M 


August 


AM 


September 


fiM 


October 


+M 


November 




December 


\'=-Fl 


Monday 




Tuesday 




Wednesday 




Thursday 


mmm 


Friday 




Saturday 




Sunday 


MfflB 







Figure 5: Lexicons employed for paragraph (top) and sentence (bottom) alignment. 



a mean of zero. Given the relative unimportance 
of the exact distributions, all were simply assumed 
to be normally distributed with a variance of 0.07 
instead of sampling each parameter individually. 
This variance is fairly sharp, but nonetheless, con- 
servatively reflects a lower reliability than most of 
the cues actually possess. 

Using the lexical cue extensions, the Type I 
results on the same test file rise to 92.1% of true 
Li ^ L2 pairs correctly identified, as compared to 
86.4% for the pure length-based method. The im- 
provement is entirely in the introductory session 
header. Without the header, the rate is 95.0% as 
compared to 95.2% earlier (the discrepancy is in- 
significant and is due to somewhat arbitrary deci- 
sions made on anomolous regions). Again, caution 
should be exercised in interpreting these percent- 
ages. 

By the alternative Type II measure, 96.1% 
of the output 1-for-l pairs were true matches, 
compared to 86.1% using the pure length-based 
method. Again, there is an insignificant drop 
when the header is discarded, in this case from 
96.3% down to 95.8%. 

CONCLUSION 

Of our raw corpus data, we have currently aligned 
approximately 3.5Mb of combined English and 
Chinese texts. This has yielded 10,423 pairs clas- 
sified as 1-for-l, which we are using to extract 
more refined information. This data represents 
over 0.217 million English words (about 1.269Mb) 
plus the corresponding Chinese text (0.659Mb). 

To our knowledge, this is the first large-scale 
empirical demonstration that a pure length-based 
method can yield high accuracy sentence align- 
ments between parallel texts in Indo-European 
and entirely dissimilar non-alphabetic, non-Indo- 
European languages. We are encouraged by the 



results and plan to expand our program in this 
direction. 

We have also obtained highly promising im- 
provements by hybridizing lexical and length- 
based alignment methods within a common sta- 
tistical framework. Though they are particularly 
useful for non-alphabetic languages where charac- 
ter length is not as discriminating a feature, we be- 
lieve improvements will result even when applied 
to alphabetic languages. 
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